Identification-detection group testing protocols for COVID-19 at high prevalence

Group testing allows saving chemical reagents, analysis time, and costs, by testing pools of samples instead of individual samples. We introduce a class of group testing protocols with small dilution, suited to operate even at high prevalence (5–10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%), and maximizing the fraction of samples classified positive/negative within the first round of tests. Precisely, if the tested group has exactly one positive sample then the protocols identify it without further individual tests. The protocols also detect the presence of two or more positives in the group, in which case a second round could be applied to identify the positive individuals. With a prevalence of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\%$$\end{document}5% and maximum dilution 6, with 100 tests we classify 242 individuals, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92\%$$\end{document}92% of them in one round and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$8\%$$\end{document}8% requiring a second individual test. In comparison, the Dorfman’s scheme can test 229 individuals with 100 tests, with a second round for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$18.5\%$$\end{document}18.5% of the individuals.


18.5% of the individuals.
We consider those situations where it is necessary to check if some individuals are positive with respect to a given disease. With a direct approach, samples taken from the individuals can be tested one by one, with a number of tests equal to the number of individuals under test. In many cases, however, it is possible to pool samples taken from different individuals and test the pool: if the pool is negative then all the corresponding individuals are declared as negative, while if the pool is positive it means that at least one is positive. Several group testing (GT) techniques based on pooling to reduce the number of tests have been proposed, starting from the work by Dorfman 1 . When the disease prevalence is not too large, this brings considerable savings in terms of tests and therefore chemical reagents, analysis time, effort, and costs. Recently, due also to the cost of sophisticated tests like those based on polymerase chain reaction or transcription mediated amplification, the use of group testing has been advocated to enable mass screening in the context of the SARS-CoV-2 pandemic, with experimental campaigns implemented in a few countries 2 . In adaptive group testing, the tests are performed in sequence, with pools that are created based on the outcomes of the previous tests 3,4 . On the contrary, in non-adaptive group testing all pools are a-priori set, and tests are carried out in parallel. Both approaches have advantages and shortcomings: adaptive strategies can identify the status of individuals with fewer tests. Nevertheless, considering the time required to carry out each test, a pure adaptive strategy may require an excessive amount of time. Non-adaptive schemes require typically more tests to succeed, but they are faster as tests can be performed in parallel. To combine the advantages of both techniques, while mitigating their limitations, it is sometimes preferable to implement a hybrid approach, where a first screening is performed via a non-adaptive testing step, followed by an adaptive (or even individual) one for the population members that are identified as potentially infected. Approaches of this kind, which date back to the original work of Dorfman 1 , enable remarkable savings in the number of tests. Several current investigations on the use of group testing for SARS-CoV-2 screening follow this line [5][6][7][8][9][10][11][12] . In particular, in the context of group testing for SARS-CoV-2, the simple Dorfman approach has been validated by verifying the sensibility of polymerase chain reaction tests with respect to the size n of the pools 6,7 . Non-adaptive protocols relying on Reed-Solomon error correcting codes to design the pools have been used to target low infection rate regime (e.g., prevalence below 1.3%) 8 . Bayesian approaches to identify the set of infected samples in a non-adaptive group testing approach have also been addressed 9 , as well as schemes that exploit a quantitative knowledge on the viral load in the pools 10,11 . Other approaches to group testing in the low prevalence regime exploit a geometrical construction of the pools in a non-adaptive setting 12 .
Differently from previous works, we here are not limited to low prevalence. Specifically, in this paper we describe a new class of protocols for group testing where the main objective is to maximize the probability that classification of samples is completed within the first round of tests. If the tested group has exactly one positive sample, then the protocols detect that there is only one positive and identify it without the need for a second www.nature.com/scientificreports/ round of individual testing. The protocols also detect the presence (without identification) of two or more positives in the group, in which case a second round must be applied to identify the positive individuals. These protocols are thus analogous to error control codes able to correct one error and detect two or more errors occurring in a group 13 . Due to this capability to directly identify one positive in the first round, this work is specially suited for high prevalence (5-10% ) scenarios, differently from other methods which address the low prevalence case 2,[6][7][8]12 . Also, due to the problem of dilution which can lead to false negatives, this work is particularly focused on those pool sizes which can be realistically used in a diagnostic laboratory 2,5,7,[14][15][16] . We present next the main results of the investigation, based on the probabilistic analysis detailed at the end of the paper.

Results
In the following we will refer to polymerase chain reaction (PCR) for the test, but the procedure is general for any possible test. With "prevalence" we will indicate the probability that an individual is positive. The direct approach to testing consists of performing individual tests, with one PCR for each individual sample, to determine if it is positive or negative. The number of tests in this case equals the number of samples to classify. In (GT), individual samples are grouped (pooling), and the pools are tested: if a pool is negative it is assumed that all individuals participating to that pool are negative. Thus, the number of PCR tests can be reduced with respect to individual testing, if the prevalence is not too high. The saving is more marked for low prevalence. In this paper we will refer to GT with a first round of pooled tests, possibly followed by a second round of some (hopefully few) individual tests to complete the classification. While the advantage is clear, it must be considered that implementing GT imposes a reorganization of the testing process, whose impact should not be underestimated. In fact, with GT a phase of preparation of the pools is necessary. This phase should be automated to avoid errors in the processing: this is already possible, as machines currently available in many diagnostic laboratories can be suitably reprogrammed for pooling. Also, while individual testing ends in a single round of PCR, in the case of GT it is sometimes necessary to carry out a second round of PCRs for some individuals (thus requiring additional time). If the number of samples to retest is large, managing the second round, where individual samples needing an individual PCR must be reexamined, should be automated to avoid errors and contamination. When full automation of the process is not available, it would be preferred to adopt GT schemes with a low fraction of samples needing an individual retest. Also, it must be remarked that large pool sizes can lead to a dilution of the viral load affecting the sensitivity of the test, therefore causing false negatives. For this reason, we will concentrate on schemes with limited pool sizes, which justifies our assumption that the false negative rate is negligible.
The baseline protocol is that originally proposed by Dorfman in 1943, where: individuals are grouped into groups of n; one pool is used to analyze all n individuals (dilution n); the mother tubes of the n individuals are set aside; the single pool is tested. If the pool is negative, all n individuals are declared negative, and no other tests are needed. If, on the contrary, the pool is positive, it is necessary to carry out a second round of individual tests on all n individuals 1 .
Identification-detection for group testing: the Pnp protocols. We propose a new class of pooling schemes with small dilution, for high prevalence testing scenarios. Assume a group test employing p pools P 1 , P 2 , . . . , P p to test a group of n > p individuals I 1 , I 2 , . . . , I n . The pooling can be described by a test matrix, where each row is a pool and each column is an individual. The matrix elements are 0 or 1, where a 1 in row i and column j indicates that individual I j participates in pool P i .
For identification-detection we propose to use test matrices composed by columns all with a fixed number c of 1s, so that each individual sample is copied into exactly c pools. With this choice, there would be c positive pools if and only if the group has exactly one positive individual. A number of positive pools larger than c indicates that there are two or more positive individuals in the group. The largest group size n for a given number of copies c and pools p is We will assume always the largest n, as for GT the objective is to test the largest possible number of individuals for a given p. The pooling matrix columns are thus all possible vectors with p − c elements to 0 and c elements to 1. The number of individuals per pool (dilution) is indicated as d. It can be checked that the dilution for p pools and n individuals, each participating in c pools, is Hence, by testing the p pools (first round), the scheme allows to classify immediately the cases of zero positives per group or one positive per group. Therefore, for up to one positive per group there is no need for individual tests. The scheme also detects the presence of two or more positives per group, in which case a second round of individual tests is required.
To keep the dilution as small as possible, we investigate in particular the case c = 2 , where each sample is copied in two pools. With this choice, from (2) the dilution is d = p − 1 . We now explicit the test matrices for dilution up to d = 6.
(1) n = p c . individual participating in c = 2 pools, and therefore with a number of individuals per group n = 6 . The test matrix for P64 is reported in Fig. 1. Each pool contains the samples from exactly three individuals, so the dilution is d = 3 , as given by Eq.
(2). The protocol is described as follows: individuals are arranged into groups of n = 6 (indicated in the figure as I 1 , ..., I 6 ); p = 4 pools are used to analyze the 6 individuals; each individual participates in c = 2 pools according to the scheme in the figure, with exactly 3 individuals in each pool; the mother tubes of the 6 individuals are set aside; the four pools are tested (e.g., by PCR).
Based on the results of the 4 tests, the following cases may arise (see Table 1): •  Performance. We present the performance of our protocol compared with the Dorfman's protocol, for dilutions ranging from d = 3 up to d = 6 . Considering both the first round of test on p pools and the occasional second round on some or all individuals, for a generic protocol we define the performance in terms of: • efficiency, quantified by the average number of individuals tested with 100 PCR tests; • probability that an individual is tested in a second round, indicated as P SR .
We assume a prevalence ǫ and independent positivity from individual to individual. To calculate efficiency, let us denote with T g the number of tests needed to identify all positives in the group. By indicating with E{} the statistical expectation, the average number of tests per individual is E T g /n . Therefore, on the average, with 100 PCRs we classify a number of individuals equal to: The statistical characterization of T g is provided in the Methods section, and leads to Eq. (9). About P SR , we observe that a second round is needed if the number of positive pools, indicated as Y, is greater than 2. The statistic of Y, provided in the Methods section, leads to Eq. (8).
Results as functions of the prevalence are shown in Figs. 2 and 3, where the probability of a second testing for an individual, given by Eqs. (6) and (8), and the efficiency, given by Eqs. (3) and (9), are reported. For example, with P64 we find that, with a prevalence ǫ = 5% , about 146 individuals are classified with 100 PCR tests. Of all individuals, 98% are classified in the first round, and only 2% need a second round.   10) and (11). The exact numbers are shown for prevalence of 5% and 10% in Table 2. For example, with a pool of n = 4 individuals and a prevalence ǫ = 5% , an average of 229 individuals are tested with 100 PCR tests with Dorfman's scheme. However, about 18.5% of all individuals need a second round of individual tests.

Discussion
We have investigated group testing consisting of a first round of pooled tests, followed by individual testing, applied to a population with high prevalence, to save resources.
Two main issues must be discussed for a practical usage of group testing. First, one should consider that the maximum pool size is limited due to the dilution of the sample viral load and the consequent problem of false negatives. For COVID-19, a conservative current estimation suggests that dilutions in the order of 5-8 would still allow a negligible false negative rate, although higher dilutions have been investigated, with some conflicting reports 2,5,7,14-16 . Second, for group testing the diagnostic laboratory must be organized to handle the whole process (pooling, first round of tests, reopening and second round of tests). Automation systems and robots, currently available in many diagnostic laboratories, can be suitably reprogrammed for pooling. The main issue is related to the management of the second rounds of tests. If the fraction of samples to retest is large, picking back the original samples of some individual to be reexamined should be automated, to avoid errors and contamination. When the process is not fully automated it could be necessary to use protocols able to complete the positive identification mostly within the first round of tests, with a small rate of individuals to be retested. In fact, with low rates of second rounds it may be possible to handle the retesting process even manually, thus simplifying the organization of a diagnostic laboratory. Reducing the second round tests will also have the advantage of giving a faster classification.
Limiting the discussion to dilutions up to 6, we have found that in the range of prevalence 5-10% the best choice is represented by the identification-detection scheme P217, which outperforms all the others in terms of efficiency while still having a low rate of second round tests. For example, at ǫ = 5% it allows to classify 242 individuals with 100 tests, with a rate of second round individual tests of about 8% . The scheme also performs better than the Dorfman's scheme both in terms of efficiency and rate of second round individual tests. Compared with the proposal, the Dorfman's schemes are in fact less efficient and have much larger rates of individuals tested twice (one time in group, then individually). They are therefore not suitable at high prevalence. Identification-detection schemes with dilutions 3 − 5 give less advantages in terms of number of tests, but offer a smaller rate of second  www.nature.com/scientificreports/ round tests. The scheme P64, with dilution 3, is the one with the smallest rate of retested individuals, having a rate of individual retest of 0.088% at ǫ = 1% , of 2% at ǫ = 5% , and of 7% at ǫ = 10% . The choice of the specific protocol imposes therefore, for a given maximum dilution, a trade-off among efficiency and second round rates.
Having to handle few second round tests, all new protocols seems suitable even for non automated laboratories at prevalence up to 5% , with few percent of the individuals needing a second round. At prevalence 10% the only protocols with small probability of second round are P105 and P64, with respectively 12.7% and 7% of the individuals which need retesting.
We derived also analytical expressions for the performance of the new protocols, allowing the design of identification-detection group testing pooling schemes for arbitrary dilutions and for the targeted prevalence rates.

Methods
Performance of the Pnp protocols. In this section we derive expressions for the performance of the proposed protocols assuming c = 2 , which is the most effective value of c to limit dilution. The analysis can however be generalized to other values of c.
Let us denote as X the number of positive individuals in a group of n individuals, and as Y the number of positive pools out of the p pools. For an identification-detection protocol able to identify one single positive in a group of n and detect two or more positives, the probability that the group must be reopened for a second round is The average number of tests per group can be bounded by assuming that the second round is taken on all n individuals To derive a precise analysis we must consider that the second round occurs on subsets of the group, depending on the number of positive pools. To this aim, we observe that the probability that y pools are positive is where a(x, y) is the number of group configurations with x positive individuals and y positive pools, so that, for example, it is a(1, 2) = n . The values of a(x, y) for arbitrary x, y can be derived by combinatorial analysis. Specifically, we prove at the end of the paper that a(x, y) is given by the recursion Values of a(x, y) needed to evaluate Pr Y = y are those for y > 2 , which are reported for some protocols of interest in Tables 3, 4, 5 and 6.
Then, we observe that if there are Y = y positive pools the number of individuals to retest in a second round is y 2 . Therefore, the probability that an individual needs a second round is The exact average number of tests needed to classify all n individuals in a group is then and the number of individuals tested with 100 tests, defined by (3), is rewritten as 100/(p/n + P SR ). (4) Pr Y = y = p + nP SR www.nature.com/scientificreports/   Table 5. a(x, y) for P156.  Table 6. a(x, y) for P217. www.nature.com/scientificreports/ Performance of the Dorfman's scheme. For completeness, we review also the performance for the Dorfman's protocol 1 . In the same hypothesis above, the probability that a second round is needed for the Dorfman's scheme (in this case all individuals have to be retested) is and the average number of tests per group is The number of individuals tested with 100 tests is then 100/(1/n + P SR ).

Recursive computation of a(x, y).
When c = 2 , the pooling matrix is amenable of a simple graphical description. In particular, a pooling matrix with p rows and n columns can be represented as a graph G with p vertices, each one associated with a matrix row (equivalent, with a pool), and n edges, each one associated with a matrix column (equivalently, with an individual). An edge connects two vertices if and only if the individual corresponding to the edge participates in the two pools corresponding to the vertices. An example is provided in Fig. 4 for the P64 pooling matrix. The value of a(x, y) equals the number of sub-graphs of G having having y vertices (with nonzero degree) and x edges. Since the pooling matrix columns are all length-p binary vectors with two 1s, we can focus on a specific subset S of vertices with cardinality y and search for the number of sub-graphs of G with y vertices (with nonzero degree) and x edges, all vertices belonging to S. This number is denoted by ã(x, y) and is related to a(x, y) by The value of ã(x, y) equals the number of ways in which, given the set S of y vertices, we can place x edges in such a way that each vertex is connected to at least one edge. This is equal to the total number of ways in which the x edges can be placed, y(y − 1)/2 , minus the number of edge configurations in which only ℓ nodes are "touched", for ℓ ∈ {1, 2, . . . , y − 1} . This yields which in particular gives ã(x, 2) = 1 if x = 1 and ã(x, 2) = 0 otherwise. Incorporating Eq. (13) into Eq. (12) gives, after some simplifications, Eq. (7).